Introduction to R

Session 2

Session Overview

  1. Objects
  2. Vectors
  3. Matrices
  4. Lists
  5. Data frames

R Training Team Today

Martin Schumann

  • Assistant Professor at QE
  • Research interests: panel data, nonlinear models, difference-in-differences, network data, innovation
  • Website
  • Sessions 1 and 2

Stephan Smeekes

  • Professor of Econometrics at QE
  • Research interests: econometrics, time series, high-dimensional statisics, bootstrap, macro- and climate econometrics
  • Website
  • Sessions 2, 3 and 4

Objects

Objects

  • In R, everything is an object.

  • Objects have a name that is assigned with <- (recommended) or =.

  • Names have to start with a letter and include only letters, numbers, and characters such as “.” and “_”.

  • R is case sensitive: \(\Rightarrow Name\neq name\)!

  • Objects can store vectors, matrices, lists, data frames, functions…

# generate object x (no output):
x <- 5
# display log(x)
log(x)
[1] 1.609438
# object X is not defined => error message 
X
Error: object 'X' not found

Vectors

Vectors

  • Vectors can store multiple types of information (e.g., numbers or “characters”).
  • To define a 3-dimensional vector named “vec”, use vec <- c(value1, value2, value3).
  • Operators and functions can be applied to vectors, which means they are applied to each of the elements individually.
# define vector named 'vec'
vec <- c(1, 2, 3)
# take the square root of 'vec' and store the result in 'sqrt_vec'
sqrt_vec <- sqrt(vec)
# display sqrt_vec
print(sqrt_vec)
[1] 1.000000 1.414214 1.732051

Vectors - some helpful shortcuts

  • R has built-in functions that generate sequences (useful for loops or plots, among other things).
  • We can also repeat elements using rep().
# generate sequence 5,6,...,10
5:10
[1]  5  6  7  8  9 10
# generate sequence from 1 to 10 in steps of 0.5
seq(from = 1, to = 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
#generate 4-dimensional vector of ones
rep(1, 4)
[1] 1 1 1 1

Order matters!

  • Be aware of the order of operations!
  • compare the following:
1+2:3^2 # '^2' evaluated before ':', only then '+1' is evaluated
[1]  3  4  5  6  7  8  9 10
1+2:3*4 # first ':', then '*4', then '+1'
[1]  9 13
# use brackets to avoid confusion or mistakes
(1+2):(3*4)
 [1]  3  4  5  6  7  8  9 10 11 12

Summarizing vectors

  • R has built-in functions to summarize the information stored in vectors.
  • Remark: R is very good at generating random numbers! (Such functions are studied in more detail in the next session.)
# Example: generate 100 random draws from a normal distribution with mean 1
# and standard deviation 2
norm.vec <- rnorm(n = 100, mean = 1, sd = 2)
# get mean 
mean(norm.vec)
[1] 0.767044
# get standard deviation 
sd(norm.vec)
[1] 2.111653
# get maximum
max(norm.vec)
[1] 5.737621

Exercise 2.1: generating and summarizing vectors

  • draw 50 random numbers from a normal distribution with mean 0 and variance 1. Store your results in the object norm.vec.
  • calculate the mean and standard deviation of norm.vec.
  • use rep() to repeat each element of norm.vec 3 times. Store the result in the object norm.vec.rep.
  • Is mean(norm.vec.rep^2) equal to mean(norm.vec.rep)^2?

logical operators

  • logical operators can be either TRUE or FALSE.
  • Extremely useful for conditional statements, e.g. if(condition is TRUE){do this}else{do that}.
  • We can check if two objects are equal by ==, different by != or compare them with < and >.
  • We can combine logical statements with “AND” & and “OR” |
# define  objects 
obj1 <- 1
obj2 <- 2
obj3 <- 1 # same value as obj1
obj1 == obj2 # false statement
[1] FALSE
obj1 != obj2 # true statement
[1] TRUE
obj1 == obj2 & obj1 == obj3 # FALSE AND TRUE => FALSE
[1] FALSE
obj1 == obj2 | obj1 == obj3 # FALSE OR TRUE => TRUE
[1] TRUE
  • We can also use logical operators in vectors.
  • the AND and OR operators & and | are then applied element-wise.
vec2 <- 1:5 # defines vector vec2=(1,2,3,4,5)
vec2 == 3  # =FALSE if element is not 3, =TRUE if element is 3
[1] FALSE FALSE  TRUE FALSE FALSE
vec2 >= 2 & vec2 < 5 # Only TRUE for elements >=2 and <5
[1] FALSE  TRUE  TRUE  TRUE FALSE
vec2 >= 2 | vec2 < 5 # TRUE for all elements since either >=2 or <5
[1] TRUE TRUE TRUE TRUE TRUE

Characters

  • Vectors can also store characters.
  • characters are enclosed in ""or ''.
# define a vector of 2 cities
cities <- c('Maastricht', "Amsterdam",'Rotterdam')
print(cities)
[1] "Maastricht" "Amsterdam"  "Rotterdam" 

Exercise 2.2: type coersion

  • R tries to make objects comparable by coercing one object into the type of another.
  • This can sometimes be handy, but sometimes it leads to unforeseen errors (e.g., when loading new data). To illustrate this, do the following:
    • compare the character "1" to the numeric 1.
    • try computing the sum of "1" and "2".
    • try computing the sum of as.numeric("1") and as.numeric("2"). What happened?
    • create a mixed vector containing the numeric 1 and the character "2". Of which type are the elements of the vector?

factors

  • Many variables are qualitative rather than quantitative.
  • While they are often coded using numbers, they don’t have a numerical meaning.
  • Examples: gender, nationality…
  • Can also be ordinal, i.e., the outcomes can be ranked (e.g., “bad”, “meh”, “great”).
x <- c(1, 3, 3, 2, 1, 3)
xf <- factor(x, labels = c("bad", "ok", "good"))# no ranking
xf
[1] bad  good good ok   bad  good
Levels: bad ok good
# now with ranking
xf.ordered <- factor(x, labels = c("bad", "ok", "good"), ordered = TRUE)
xf.ordered
[1] bad  good good ok   bad  good
Levels: bad < ok < good

Names

  • You can give the elements of your vector names either directly or using the names() command.
  • This is very useful for accessing elements (see nect slide)
avg_temp <- c(Maastricht = 14.2, Amsterdam = 13.4, Rotterdam = 13.7)
print(avg_temp) # names appear on top of elements
Maastricht  Amsterdam  Rotterdam 
      14.2       13.4       13.7 
names(avg_temp) # returns names of elements
[1] "Maastricht" "Amsterdam"  "Rotterdam" 
# Alternatively, we can define data and names separately
temp <- c(14.2, 13.4, 13.7)
names(temp) <- cities # recall that we have defined "cities" earlier!
print(temp)
Maastricht  Amsterdam  Rotterdam 
      14.2       13.4       13.7 

Accessing elements

  • One can access the elements of a vector either by name or position.
# return the second element of "avg_temp" defined before
avg_temp[2] 
Amsterdam 
     13.4 
# return the element corresponding to "Maastricht"
avg_temp["Maastricht"]
Maastricht 
      14.2 
# trying to access a non-existing element yields "NA"
# ( for "not available"), i.e. a missing value
avg_temp[4]
<NA> 
  NA 
  • By using the minus sign [-k], we can get the vector except for the \(k\)-th element.
  • We can also add elements to an existing vector.
# get the vector except for the third element
avg_temp[-3]
Maastricht  Amsterdam 
      14.2       13.4 
# now add another city to avg_temp
avg_temp["Tilburg"] <- 14.7
# now the fourth element is defined!
avg_temp[4]
Tilburg 
   14.7 

More on NA, NaN, Inf

  • NA (“not available”) indicates missing values.
  • Anything combined with NA yields NA.
  • NaN(“not a number”) indicates the result of a mathematically undefined operation.
#define another vector
vec3 <- c(-1.2, NA, 0)
# combine avg_temp and vec3
vec4 <- c(avg_temp, vec3) 
# divide elements by 0; notice the different outcomes
vec4 / 0 
Maastricht  Amsterdam  Rotterdam    Tilburg                                  
       Inf        Inf        Inf        Inf       -Inf         NA        NaN 

Matrices

Matrices

  • We can create a matrix with m rows directly using matrix(vector,nrow=m).
# create matrix with 3 rows; fill numbers by row
mat1 <- matrix(1:12, nrow = 3, byrow = TRUE) # by default, R fills matrices by column
mat1
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
  • We can also combine vectors of the same length by row with rbind(v1,v2,...) or by column by cbind(v1,v2,...).
# create vectors v1, v2 and v3 and combine them for same result
v1 <- 1:4
v2 <- 5:8
v3 <- 9:12
mat2 <- rbind(v1, v2, v3)
mat2
   [,1] [,2] [,3] [,4]
v1    1    2    3    4
v2    5    6    7    8
v3    9   10   11   12

Matrix indexing

# assign names to columns
colnames(mat1) <- c("col1", "col2", "col3", "col4")
mat1
     col1 col2 col3 col4
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
# assign names to rows
rownames(mat1) <- c("row1","row2","row3")
mat1
     col1 col2 col3 col4
row1    1    2    3    4
row2    5    6    7    8
row3    9   10   11   12

Accessing elements

  • We can access single elements by [rownumber,colnumber], the k-th row by [k,] and the k-th column by [,k].
# get element in second row in third column
mat1[2,3]
[1] 7
# get second row
mat1[2,]
col1 col2 col3 col4 
   5    6    7    8 
# get third column
mat1[,3]
row1 row2 row3 
   3    7   11 
  • If rows/columns have names, we can also use those.
  • Using vectors, we can also create more complicated subsets of matrices.
# get sub-matrix using vectors
mat1[c(2,3),c(1:3)]
     col1 col2 col3
row2    5    6    7
row3    9   10   11
# get second row using names (recall definition of mat2)
mat2["v2",]
[1] 5 6 7 8

Exercise 2.3: creating matrices, accessing elements

  1. Create the 3x3 identity matrix “by hand”. To do so:
    1. create 3 vectors of with zeros and ones in the appropriate spots.
    2. use rbind() or cbind() to combine them into the identity matrix.
    3. store the identity matrix as the object “I_mat”.
    4. R makes your life easy: type diag(3) in your console.
  2. Replicate the following Excel-matrix:


3. Get the data for April and May by - including only the first and second row - excluding the third row - using the names

Bonus: Matrix algebra

  • R can do matrix “regular” algebra, and even lets you do operations that are not well-defined mathematically.

  • t(A) is the transpose of the matrix A.

# define matrix containing normal data
data.vec <- rnorm(9, mean = 0, sd = 1)
A <- matrix(data.vec, nrow = 3)
A # return A
          [,1]       [,2]       [,3]
[1,]  1.520550  0.7848033  0.2458953
[2,] -1.347184 -1.9176178 -0.7027760
[3,] -0.205259  0.6375446 -0.1060013
t(A) # return the transpose
          [,1]      [,2]       [,3]
[1,] 1.5205499 -1.347184 -0.2052590
[2,] 0.7848033 -1.917618  0.6375446
[3,] 0.2458953 -0.702776 -0.1060013
  • solve(A) returns the inverse of an invertible matrix.
solve(A) # return the inverse of A
             [,1]       [,2]       [,3]
[1,]  0.952893916  0.3510648 -0.1170526
[2,]  0.002118166 -0.1619678  1.0787407
[3,] -1.832426436 -1.6539502 -2.7191039
  • *does element-wise multiplication.
  • %*% does matrix multiplication .
# element-wise multiplication
A * solve(A) # NOT the identity matrix
             [,1]       [,2]        [,3]
[1,]  1.448922752  0.2755168 -0.02878268
[2,] -0.002853559  0.3105924 -0.75811305
[3,]  0.376122022 -1.0544670  0.28822861
# matrix multiplication
A %*% solve(A) 
              [,1]          [,2]         [,3]
[1,]  1.000000e+00  0.000000e+00 0.000000e+00
[2,]  0.000000e+00  1.000000e+00 2.220446e-16
[3,] -8.326673e-17 -5.551115e-17 1.000000e+00
# yields the identity (up to a small error due to the
# numerical computation of the inverse)

Lists

Lists

  • A list is a generic collection of objects.
  • Unlike vectors, the components can have different types (e.g., numeric and character).
  • Many functions output lists, so knowing how to access elements is very useful.
  • Generate a list with mylist<- list(name1=component1, name2=component2,...).
mylist <- list(num.vec = 1:3, city = "Maastricht") 
print(mylist)
$num.vec
[1] 1 2 3

$city
[1] "Maastricht"
  • Get names of the components with names(mylist).
  • You can access components with the $ (dollar sign) operator, e.g., mylist$name1, or by position with [[]].
mylist$city
[1] "Maastricht"
mylist[[2]] # same result
[1] "Maastricht"

Data frames

Data frames

  • data frames are simply data sets in R terminology.
  • So-called data files can contain multiple data sets.
  • We can create a data frame by data.frame() or transform a matrix mat into a data frame by as.data.frame(mat).
  • Many functions (e.g. lm() for regressions) need a data frame as input (see later sessions).
# generate a data frame
ID <- 1:4
hourly_wage <- rnorm(n = 4, mean = 20, sd = 1) # create 4 draws from N(20,1)
city <- c("Maastricht", "Eindhoven", "Amsterdam", NA)
dats <- data.frame(ID, hourly_wage, city) # add new variable
dats
  ID hourly_wage       city
1  1    20.43265 Maastricht
2  2    21.24930  Eindhoven
3  3    19.53891  Amsterdam
4  4    19.18459       <NA>
  • As with lists, we can access variables using the $ operator.
  • We can also add new variables using the $ operator.
  • View() opens a data-viewer. Very useful (but difficult to demonstrate on these slides).
dats$city # "city" is NA for ID 4.
[1] "Maastricht" "Eindhoven"  "Amsterdam"  NA          
dats$city[4] <- 'Tilburg' # assign city to ID 4
dats$educ <- c(12, 21, 9, 10)
dats
  ID hourly_wage       city educ
1  1    20.43265 Maastricht   12
2  2    21.24930  Eindhoven   21
3  3    19.53891  Amsterdam    9
4  4    19.18459    Tilburg   10
  • Using subset(data_frame,condition), we can easily get a subset of the original data frame where condition is TRUE.
# only keep individuals with at least 10 years of education
sub_dats <- subset(dats, educ > 10)
sub_dats
  ID hourly_wage       city educ
1  1    20.43265 Maastricht   12
2  2    21.24930  Eindhoven   21

Exercise 2.4

  • Create your own data frame:
    • create a vector ID that contains the sequence 1,2,…,100.
    • create a vector income that contains 100 random draws from N(10,1).
    • create a dummy female that is 1 for ID=1,...,50 and 0 otherwise. (hint: you can achieve this by using rep() twice and combining two vectors with c())
    • collect the variables in a data frame my_df.
    • inspect your data with View(my_df)
    • bonus: create a subset sub_my_df that contains only individuals with income larger than 10.

Bonus: teaching regression with R

  • To give students a feeling for the behavior of the least squares estimator, it can be very useful to use simulated data.
  • This allows teachers to visualize the effects of various quantities of interest, e.g., sample size, variation in the observed and unobserved variables, or omitted variables.
n <- 100 # set the sample size
X <- rnorm(n, mean = 1, sd = 2)# define the observed covariate X
epsilon <- rnorm(n, mean = 0, sd = 1) # define the model error
beta0 <- 1 # define true intercept
beta1 <- 2 # define true slope
Y <- beta0 + beta1 * X + epsilon # generate Y according to a linear model
# recall the formula in a bivariate model
beta1.hat <- cov(X,Y) / var(X)
beta0.hat <- mean(Y) - beta1.hat * mean(X)
# print estimators
beta0.hat
[1] 1.044901
beta1.hat
[1] 1.944837

Bonus exercise: simulate!

  • Repeat the previous simulation, but
    • change the sample size
    • change the mean of X. What is the effect on beta1.hat?
    • change the mean of epsilon. What is the effect on beta0.hat?
    • change the variance of X and epsilon. What is the effect?